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Abstract: We study large-scale distributed cooperative systems that use optimistic replication. We 
represent a system as a graph of actions (operations) connected by edges that reify semantic con- 
straints between actions. Constraint types include conflict, execution order, dependence, and atom- 
icity. The local state is some schedule that conforms to the constraints; because of conflicts, client 
state is only tentative. For consistency, site schedules should converge; we designed a decentralised, 
asynchronous commitment protocol. Each client makes a proposal, reflecting its tentative and/or 
preferred schedules. Our protocol distributes the proposals, which it decomposes into semantically- 
meaningful units called candidates, and runs an election between comparable candidates. A candi- 
date wins when it receives a majority or a plurality. The protocol is fully asynchronous: each site 
executes its tentative schedule independently, and determines locally when a candidate has won an 
election. The committed schedule is as close as possible to the preferences expressed by clients. 

Key-words: data replication, optimistic replication, semantic replication, commitment, voting 
protocols. 
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Un protocole de validation pour la replication optimiste dans les 
systemes repartis semantiquement riches 

Resume : Nous examinons a travers ce document la coherence dans les systemes repartis repliquant 
des donnees de maniere optimiste. Le paradigme de la replication optimiste est que les sites com- 
posant le systeme reparti peuvent re-executer les requetes des clients (actions) si la semantique liant 
les actions le necessite. Dans de tels systemes le critere de coherence est que les sites convergent a 
terme vers des executions equivalentes. Afin d' assurer cette convergence, un protocole de validation 
est necessaire. C'est l'objet de cette etude. Notre protocole procede par elections successives sur des 
ensembles d' actions executees de maniere optimiste par le systeme. La semantique prise en compte 
dans ce protocole est suffisament riche pour exprimer des notions telles que la non-commutativite, 
le conflit ou encore la causalite entre les actions. Nous prouvons que notre protocole est sur, et ce en 
depit des eventuelles pannes franches pouvant survenir sur les sites. 

Mots-cles : replication optimiste, validation, protocoles de vote 
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1 Introduction 

In a large-scale cooperative system, access to shared data is a performance and availability bottle- 
neck. One solution is optimistic replication (OR), where a process may read or update its local 
replica without synchronising with remote sites ifTTl . OR decouples data access from network ac- 
cess. 

In OR, each site makes progress independently, even while others are slow, currently discon- 
nected, or currently working in isolated mode. OR is well suited to peer-to-peer systems and to 
devices with occasional connectivity. 

Some limited knowledge of semantics provides a lot of extra power and flexibility. Therefore, 
we model the system as a graph, called a multilog, where each vertex represents an action (i.e., an 
operation proposed by some client), and an edge is a semantic relation between vertices, called a 
constraint. Our constraints include conflict, ordered execution, causal dependence, and atomicity. 
Each site has its own multilog, which contains actions submitted by the local client, and their con- 
straints, as well as those received from other sites. The current state is some execution schedule 
that contains actions from the site's multilog, arranged to conform with its constraints. For instance, 
when actions are antagonistic, at least one must abort; an action that depends on an aborted action 
must abort too; non-commutative actions should be scheduled in the same order everywhere, etc. 
The site may choose any conforming schedule, e.g., one that minimises aborts, or one that reflects 
user preferences. 

For consistency, sites should agree on a common, stable and correct schedule. We call this 
agreement commitment. Some cooperative OR systems never commit, such as Roam lfl6l or Draw- 
Together [6|. Previous work on commitment for semantic OR such as Bayou |20| or IceCube ifTSIl 
centralises the agreement at a central site. Other work decentralises commitment (e.g., Paxos con- 
sensus 1 11 J) but ignores semantics. It is difficult to reconcile semantics and decentralisation. One 
possible approach would use Paxos to compute a total order, and abort any actions for which this 
order would violate a constraint. However this approach aborts actions unnecessarily. Furthermore, 
the arbitrary total order may be very different from what users expect. 

A better approach is to order only non-commuting pairs of actions, to abort only when actions 
are antagonistic, to minimise dependent aborts, and to remain close to user expectations. We pro- 
pose an efficient, decentralised protocol that uses semantic information for this purpose. Partic- 
ipating sites make and exchange proposals asynchronously; our algorithm decomposes each one 
into semantic ally-meaningful candidates; it runs elections between comparable candidates. A can- 
didate that collects a majority or a plurality wins its election. Voting ensures that the common 
schedule is similar to the tentative schedules, minimising user surprise. Our protocol orders only 
non-commuting actions and minimises unnecessary aborts. 

This paper makes several contributions: 

• Our algorithm combines a number of known techniques in a novel manner. 

• We identify the concept of a semantically-meaningful unit for election (which we call a can- 
didate). 
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• We propose an efficient commitment protocol system that is both decentralised and semantic- 
oriented, and that has weak communication requirements. 

• We show how to minimise user surprise, the committed schedule being similar to local tenta- 
tive schedules. 

• We prove that the protocol is safe even in the presence of non-byzantine faults. The protocol 
is live as long as a sufficient number of votes are received. 

The outline of this paper follows. Section [2] introduces our system model and our vocabulary. 
Section[3]discusses an abstraction of classical OR approaches that is later re-used in our algorithm. 
Section |4] specifies client behaviour. Our commitment protocol is specified in Section [5] Section [6] 
provides a proof outline and adresses message cost. We compare with related work in Section [7] In 
conclusion, Section [8]discusses our results and future work. 



2 System model 

Following the ACF model |[T8l . an OR system is an asynchronous distributed system of n sites 
i, /',... 6 J . A site that crashes eventually recovers with its identity and persistent memory intact 
(but may miss some messages in the interval). Clients propose actions (deterministic operations) 
noted a, p, . . . £ A. An action might request, for instance, "Debit 100 euros from bank account 
number 12345." 

A multilog is a quadruple M = (K,—>, <,!/(), representing a graph where the vertices K are ac- 
tions, and — >, < and jft (pronounced Not After, Enables and NonCommuting respectively) are three 
sets of edges called constraints. We will explain their semantics shortlyQ 

We identify a state with a schedule S, a sequence of distinct actions ordered by <$ executed 
from the common initial state INIT. The following safety condition defines semantics of NotAfter 
and Enables in relation to schedules. We define E(M), the set of schedules S that are sound with 
respect to multilog M, as follows: 



SeE(M)=Va,peA,<{ 



init e S 

ae S=> ae K 

a e S A a ^ init init < s a 
(a^p)Aa,peS^a< s p 



Constraints represent scheduling relations between actions: NotAfter is a (non-transitive) ordering 
relation and Enables is right-to-left (non-transitive) implication]! 



'Multilog union, inclusion, difference, etc., are defined as component-wise union, inclusion, difference, etc., respectively. 
For instance if M = (Jt,-»-, andM' = (£',->', their union is MUM' = (KU K' , -> U ->', < U <'4u|f). 

2 A constraint is a relation in Ax A. By abuse of notation, for some relation 5^., we write equivalently (a %, p) G M or 
a X. p or (a, p) e Jfis symmetric and < is reflexive. They do not have any further special properties; in particular, — » and 
<3 are not transitive, are not orders, and may be cyclic. 
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Constraints represent semantic relations between actions. For instance, consider a database sys- 
tem (more precisely, a serialisable database that transmits transactions by value, such as DBSM 
lfT3l ). Assume shared variables x,y,z are initially zero. Two concurrent transactions T\ =r(x)0;w(z)l 
and T2 = w(x)2 are related by T\ — > T%, since 7i read a value that precedes T^'s write|3 T\ and 
T3 = r(z)0; w(x)3 are antagonistic, i.e., one or the other (or both) must abort, as each is NotAfter the 
other. In the execution T\\ T4 where T4 = r(z)l, the latter transaction depends causally on the former, 
i.e., they may run only in that order, and T4 aborts if T\ aborts; we write T\ — ► T4 A T\ <\T4. As 
another example, Section l6~5l discusses how to encode the semantics of database transactions with 
constraints. 

Non-commutativity imposes a liveness obligation: the system must put a NotAfter between non- 
commuting actions, or abort one of them. (Therefore, non-commutativity does not appear in the 
above safety condition.) The system also has the obligation to resolve antagonisms by aborting 
actions. 

For instance, transactions T\ and T$ = r(y)Q commute if x,y and z are independent. In a database 
system that commits operations (as opposed to commiting values), transactions 7g ="Credit 66 euros 
to Account 12345" and Tj ="Credit 77 euros to Account 12345" commute since addition is a com- 
mutative operation, but T$ and 7g ="Debit 88 euros from Account 12345" do not, if bank accounts 
are not allowed to become negative. We write 76 jfTg. 

Order, antagonism and non-commutativity are collectively called conflicts^ 

Clients submit actions to their local site; sites exchange actions and constraints asynchronously. 
The current knowledge of Site i at time t is the distinguished site-multilog M, (f ). Initially, M, (0) = 
({init}. 0, 0, 0), and it grows over time, as we will explain later. A site's current state is the 
site-schedule Si{t), which is some (arbitrary) schedule G E(M,(f )). 

An action executes tentatively only, because of conflicts and related issues. However, an action 
might have sufficient constraints that its execution is stable. We distinguish the following interesting 
subsets of actions relative to M. 

• Guaranteed actions appear in every schedule of E(M). Formally, Guar(M) is the smallest 
subset of K satisfying: INIT G Guar(M) A ((a G Guar(M) A p <3 a) P € Guar(M)). 

• Dead actions never appear in a schedule of E(M). Dead(M) is the smallest subset of A sat- 
isfying: (((Xi , . . . , a,„>o e Guar{M)) A (p — > (Xi — > . . . — * a,„ — * P) => P £ Dead(M)) A ((a G 
Dead(M) A a < p) => f 1 G Dead{M)). 

• Serialised actions are either dead or ordered with respect to all non-commuting constraints. 

Hpf 

Serialised(M) = {a G #|Vp G^„a||-p^a^pVp^aVPG Dead(M) V a G Dead(M)}. 

def 

• Decided actions are either dead, or both guaranteed and serialised. Decided(M) — Dead(M) U 
(Guar(M) (~)Serialised(M)). 

i r(x)n stands for a read of x returning value n, and w(x)n writes the value n into x. 

4 Some authors suggest to remove conflicts by transforming the actions [191. We assume that, if such transformations are 
possible, they have already been applied. 
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Figure 1: ^conservative (<) : Applying semantic constraints to a given total order 

• Stable (i.e., durable) actions are decided, and all actions that precede them by NotAfter or En- 

def 

ables are themselves stable: Stable(M) = Dead(M)U{a£ Guar(M)CiSerialised(M)\V$ £ A, (P 
aVp<m)^pe Stable(M)}. 

To decide an action a relative to a multilog M, means to add constraints to the M, such that 
a £ Decided(M). In particular, to guarantee a, we add a < INIT to the multilog, and to kill a, we 
add a — > a; to serialise non-commuting actions a and p, we add either a — > p, p — > a, a — ► a, or 
P-P- 

Multilog M is said sound iff E(M) ^ 0, or equivalently, iff Dead(M) n Guar(M) = 0. An 
unsound multilog is definitely broken, i.e., no possible schedule can satisfy all the constraints, not 
even the empty schedule. 

Referring to the standard database terminology, a committed action is one that is both stable and 
guaranteed, and aborted is the same as dead. 

The standard correctness condition in OR systems is Eventual Consistency: if clients stop sub- 
mitting, eventually all sites reach the same state. We extend this definition by not requiring that 
clients stop, by requiring that all states be correct, and by demanding decision. 

Definition 1 (Eventual Consistency.) An OR system is eventually consistent iff it satisfies all the 
following conditions: 

• Local soundness (safety): Every site-schedule is sound: Vi,t,Si(t) £ E(M,(f)) 

• Mergeability (safety): The union of all the site-multilogs over time is sound: 

e(Um,(O)^0 

i.t 

• Eventual propagation (liveness): V/, j G J ,Vf,3f' : M;(f) C Mj(t') 

• Eventual decision (liveness): Every submitted action is eventually decided: 

Va e A, Vi £ y , Vr , 3t' : K^t) C Decided(Mi(t')) 

We assume some form of epidemic communication to fulfill Eventual Propagation. A commit- 
ment algorithm aims to fulfill the obligations of Eventual Decision. Of course, it must also satisfy 
the safety requirements. 
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3 Classical OR commitment algorithms 

Our proposal builds upon existing commitment algorithms for OR systems. Generally, these either 
are centralised or do not take constraints into account. We note A (M) some algorithm that offers 
decisions based on multilog M ; with no loss of generality, we focus on the outcome of A at a single 
site. Assuming M is sound, and noting the result M' — A (M), A must satisfy these requirements: 

• A extends its input: M <ZM' . 

• A may not add actions: K' = K. 

• A may add constraints, which are restricted to decisions: 

oc^'p =!> (a^p)V(c4p)V(p = a) 
a<'P (a<P)V(P = iNiT) 

= # 

• M' is sound. 

• M' is stable: S table (M') = K. 

A could be any algorithm satisfying the requirements. 

One possible algorithm, ^conservative (<), first orders actions, then kills actions for which the 
order is unsafe. It proceeds as follows (see Figure [T|i. Let < be a total order of actions and M a 
sound multilog. The algorithm decides one action at at time, varying over all actions, left to right; 
call the current action p. Consider actions a and y such that a < P < y: a has already been decided, 
and y has not. If P jfy, then serialise them in schedule order. If P — > a, and a is guaranteed, kill P, 
because the schedule and the constraint are incompatible. If y< p, conservatively kill P, because it 
is not known whether y can be guaranteed. By definition, if a <J p and a is dead, then p is dead. If P 
is not dead by any of the above rules, then decide p guaranteed (by adding p <J INIT to the multilog). 
The resulting £(-#conservative(<)C^O) contains a unique schedule. 

It should be clear that this approach is safe but tends to kill actions unnecessarily. 

The Bayou system ll20l applies J?Conservative(<), where < is the order in which actions are re- 
ceived at a single primary site. An action aborts if it fails an application-specific precondition, which 
we reify as a — > constraint. 

In the Last-Writer- Wins (LWW) approach [7 1, an action (completely overwriting some datum) is 
stamped with the time it is submitted. Two actions that modify the same datum are related by — > in 
timestamp order. Sites execute actions in arbitrary order and apply ^Conservative (<)■ Consequently, a 
datum has the state of the most recent write (in timestamp order). 

The decisions computed by the above systems are mostly arbitrary. A better way would be to 
minimise aborts, or to follow user preferences, or both. This was the approach of the IceCube system 
lfl5l . J?iceCube is an optimization algorithm that minimises the number of dead actions in -#i ce Cube (M) . 
It does so by heuristically comparing all possible sound schedules that can be generated from the 
current site-multilog. The system suggests a number of possible decisions to the user, who states his 
preference. 
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Algorithm 1 ClientActionsConstraints(L) 

Require: LCA 

1: Ki-=KiUL 

2: for all (a, (3) e Kj x Kj such that a—> M (3 do 

3: -i:=-,u{(a,P)} 

4: for all (a, (3) e Kj x Ki such that a <\ M p do 

5: <;:=<;U{(a,P)} 

6: for all (a, p) e Ki x Kj such that a)/[ M p do 
7: t<:=tiU{(a,P)} 



Except for LWW, which is decentralised but deterministic, the above algorithms centralise com- 
mitment at a primary site. 

To decentralise decision, one approach might be to determine a global total order <, using a 
decentralised consensus algorithm such as Paxos [11], and apply ^Conservative (<)■ As above, this 
order is arbitrary and ^conservative (<) tends to kill unnecessary. Instead, our algorithm allows each 
site to propose decisions that minimises aborts and follows local client preferences, and to reach 
consensus on these proposals in a decentralised manner. This is the subject of the rest of this paper. 

4 Client operation 

We now begin the discussion of our algorithm. We start with a specification of client behaviour. 

4.1 Client Behaviour and client interaction 

An application performs tentative operations by submitting actions and constraints to its local site- 
multilog; they will eventually propagate to all sites. 

We abstract application semantics by postulating that clients have access to a sound multilog con- 
taining all the semantic constraints: *M = (A, — * M , <\ M ). For an example M , see Section |6*31 

As the client submits actions L to the site-multilog, function ClientActionsConstraints (Algo- 
rithm[TJ adds constraints with respect to actions that the site already knows0 

To illustrate, consider Alice and Bob working together. Alice uses their shared calendar at site 1, 
and Bob at site 2. Planning a meeting with Bob in Paris, Alice submits two actions: a ="Buy 
train ticket to Paris next Monday at 10:00" and p ="Attend meeting". As p depends causally on a, 
M contains a — ^ p A a <l M p. Alice calls ClientActionsConstraints({<x\) to add action a to site- 
multilog Mi, and, some time later, similarly for p. At this point, Algorithm Q] adds the constraints 
a — * p and a < P taken from M . 

5 In the pseudo-code, we leave the current time t implicit. A double-slash and sans-serif font indicates a comment, as in 
// This is a comment. 



INRIA 



An Asynchronous, Decentralised Commitment for Optimistic Semantic Replication 



9 



Algorithm 2 ReceiveAndCompare(M) 

Declare: M = (K,—>, <,!/() a multilog receives from a remote site 
Mr- = MiUM 

for all (a, p) € x Kj such that a— > M p do 

-i:=-iU{(a,P)} 
for all (a, p) G /f, x Kj such that oc^ p do 

ft:=#,U{(a,P)} 



4.2 Multilog Propagation 

When a client adds new actions L into a site-multilog, L and the constraints computed by ClientActionsConstraints, 
form a multilog that is sent to remote sites. Upon reception, receivers merge this multilog into their 
own site-multilog. By this so-called epidemic communication J3|, every site eventually receive all 
actions and constraints submitted at any site. 

When site i receives a multilog M, it executes function Receive AndCompare (Algorithm 
which first merges what it received into the local site-multilog. Then, if any conflicts exist between 
previously-known actions and the received ones, it adds the corresponding constraints to the site- 
multilog 

Let us return to Alice and Bob. Suppose that Bob now adds action y, meaning "Cancel the 
meeting," to Mi, Action y is antagonistic with action p; hence, P~ > w yAy— > af p. Some time 
later, site 2 sends its site-multilog to site 1; when site 1 receives it, it runs Algorithm [2] notices the 
antagonism, and adds constraint p ^yAy^ p to M\ . Thereafter, site-schedules at site 1 may include 
either p or y, but not both. 

5 A decentralised commitment protocol 

Epidemic communication ensures that all site-multilogs eventually receive all information, but site- 
schedules might still differ between sites. 

For instance, let us return to Alice and Bob. Assuming users add no more actions, eventually 
all site-multilogs become ({iNiT,a,p,y},{a^ p,p^yy^ P},{a< P},0). In this state, actions 
remain tentative; at time t, site 1 might execute S\(t) = INIT;oc;P, site 2 52(f) = lNlT;oc;y, and just 
INIT at t + 1 . A commitment protocol ensures that a, p and y eventually stabilise, and that both 
Alice and Bob learn the same outcome. For instance, the protocol might add p < INIT to M\ , which 
guarantees P, thereby both guaranteeing a and killing y. a, p and y are now decided and stable at 
site 1. M\ eventually propagates to other sites; and inevitably, all site-schedules eventually start with 
INIT; a; P, and y is dead everywhere. 

6 ClientActionsConstraints provides constraints between successive actions submitted at the same site. These con- 
sist typically of dependence and atomicity constraints. In contrast, ReceiveAndCompare computes constraints between 
independently-submitted actions. 
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5.1 Overview 

Our key insight is that eventual consistency is equivalent to the property that the site-multilogs of 
all sites share a common well-formed prefix (defined hereafter) of stable actions, which grows to 
include every action eventually. Commitment serves to agree on an extension of this prefix. As 
clients continue to make optimistic progress beyond this prefix, the commitment protocol can run 
asynchronously in the background. 

In our protocol, different sites run instances of a to make proposals; a proposal being a tentative 
well-formed prefix of its site-multilog. Sites agree via a decentralised election. This works even if 
is non-deterministic, or if sites use different A algorithms. We recommend IceCube [ 15 1 but any 
algorithm satisfying the requirements of Section |3]is suitable. 

In what follows, i represents the current site, and j, k range over J7 . 

We distinguish two roles at each site, proposers and acceptors. Each proposer has a fixed weight, 
such that Y,ke j weight k = 1 . In practice, we expect only a small number of sites to have non-zero 
weights (in the limit one site might have weight 1, this is a primary site as in Sectional, but the safety 
of our protocol does not depend on how weights are allocated. To simplify exposition, weights are 
distributed ahead of time and do not change; it is relatively straightforward to extend the current 
algorithm, allowing weights to vary between successive elections. 

An acceptor at some site computes the outcome of an election, and inserts the corresponding 
decision constraints into the local site-multilog. 

Each site stores the most recent proposal received from each proposer in array proposalst, of 
size n (the number of sites). To keep track of proposals, each entry proposalsi[k] carries a logical 
timestamp, noted proposalsi[k].ts. Timestamping ensures the liveness of the election process despite 
since links between nodes are not necessarily FIFO. 

Each site performs Algorithm[3] First it initialises the site-multilog and proposals data structures, 
then it consists of a number of parallel iterative threads, detailed in the next sections. Within a thread, 
an iteration is atomic. Iterations are separated by arbitrary amounts of time. 

5.2 Epidemic communication 

The first two threads (lines[3ifT0b exchange multilogs and proposals between sites. Function Receive AndCompare 
(defined in Algorithm [2] Section l4~2b compares actions newly received to already -known ones, in 
order to compute conflict constraints. In Algorithm|6]a receiver updates its own set of proposals with 
any more recent ones. 

5.3 Client, local state, proposer 

The third thread (lines IT214T41 ) constitutes one half of the client. An application submits tentative 
operations to its local site-multilog, which the site-schedule will (hopefully) execute in the fourth 
thread. Constraints relating new actions to previous ones are included at this stage by function 
ClientActionsConstraints (defined in Algorithm!]]). 

The other half of the client is function ReceiveAndCompare (Algorithm^ invoked in the second 
thread (line[9]». 
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Algorithm 3 Algorithm at site i 



Declare: M,: local site-multilog 

Declare: proposals , [«]: array of proposals, indexed by site; a proposal is a multilog 
M,:= ({init}, 0,0,0) 

proposals^ [(({init}, 0, 0, 0),O), . . . , (({init}, 0, 0, 0),O)] 
loop // Epidemic transmission 
Choose ./' 7^ ;'; 

Send copy of M ; and proposals t to 7 

II 

loop // Epidemic reception 

Receive multilog M and proposals P from some site j ^ i 
ReceiveAndCompare(M) II Compute conflict constraints 

MergeProposals(P) 

II 

loop // Client submits 
Choose L C A 

ClientActionsConstraints{L) II Submit actions, compute local constraints 

II 

loop // Compute current local state 
Choose Si e L(Mi) 
Execute Sj 

II 

loop // Proposer 

UpdateProposal II Suppress redundant parts 

proposalsj\i] := & (M; U proposals^) II New proposal, keeping previous 
Increment proposalsj [2] J s 

II 

loop //Acceptor 

Elect 



The fourth thread (lines [T6l - fT8l l computes the current tentative state by executing some sound 
site-schedule. 

The fifth thread ( I20l - l23l l computes proposals by invoking A . A proposal extends the current site- 
multilog with proposed decisions. A proposer may not retract a proposal that was already received 
by some other site. Passing argument M, U proposals t [i] to A ensures that these two conditions are 
satisfied. 

However, once a candidate has either won or lost an election, it becomes redundant; UpdateProposal 
removes it from the proposal (Algorithm^. 

The last thread is described in the next section. 
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Algorithm 4 UpdateProposal 
1: LetP = {Kpi—*p,<p,%p) — proposals^] 
2: K P :=K P \Decided{Mi) 
3: -^p:=—ypf]Kp x K P 
4: <\ P :=<l P nKpxKp 

5: -H>:=0 

6: proposalS([i] :=P 



5.4 Election 

The last thread d25l - l2"o*l > conducts elections. Several elections may be taking place at any point in 
time. An acceptor is capable of determining locally the outcome of elections. A proposal can be 
decomposed into a set of eligible candidates. 

5.4.1 Eligible candidates 

A candidate cannot be just any subset of a proposal. Consider, for instance, proposal P = ({lNlT,oc,y}, {a 
Y,y— > a,a^ a}, {y< init}, 0), and some candidate X extracted from P. If X could contain y and 
not a, then we might guarantee y without killing a, which would be incorrect. According to this 
intuition, X must be a well-formed prefix of P: 

Definition 2 (Well-formed prefix.) Let M = <,!/[) and M' = {K',—> r , <\',Jtf) be two multi- 

. wf 

logs. M is a well-formed prefix of M , noted M C M, if (i) it is a subset ofM , (ii) it is stable, (Hi) 
it is left-closed for its actions, and ( ivj it is closed for its constraints. 



A well-formed prefix is a semantically-meaningful unit of proposal. For instance, if a — > or < cycle 
is present in M , every well-formed prefix either includes the whole cycle, or none of its actions. 

Unfortunately, because of concurrency and asynchronous communication, it is possible that 
some sites know of a — > cycle and not others; or more embarassingly, that sites know only parts 
of a cycle. Therefore we also require the following property: 

Definition 3 (Eligible candidates.) An action is eligible in set L if all its predecessors by client 
NotAfter, Enables and NonCommuting relations are in L. A candidate multilog M is eligible if all 

def 

actions in K are eligible inK: eligible(M) = Va,P € A X K, (a^ M PVoqf^ p Va<S. M p) => ae K. 



' M'CM 
K' = Stable(M') 




Va,peA,p<E£'^ { a<p^a<'p 
{ a^p^a^p 
Va, p e A, (a -V p v a <' p v ajf P) a, p e K' 
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To compute eligibility precisely would require local access to the distributed state, which is 
impossible. Therefore acceptors must compute a safe approximation (i.e., false negatives are al- 
lowed) of eligibility. For instance, in the database example, a sufficient condition for transaction 
T to be eligible at site i is that all transactions submitted (at any site) concurrently with T are also 
known at site i. Indeed, all such transactions have gone through either ClientActionsConstraints or 
ReceiveAndCompare; hence according to TableQ] T is eligible. 

5.4.2 Computation of votes 

We define a vote as a pair (weight, siteld). The comparison operator for votes breaks ties by com- 

def 

paring site identifiers: (w,i) > (w , z ) = w > w V (w = w A i > i ). Therefore, votes add up as 
follows: (w,i) + (w',i') == (w + w',max(/,/')). Candidates are compatible if their union is sound: 

def 

compatible(M ,M ) = E(MUM') 7^ 0. The votes of compatible candidates add up; tally(X) com- 
putes the total vote for some candidate X: 

tally(X)'= ^ (weighty, k) 

k:X!ZproposalSj[k] 

An election pits some candidate against comparable candidates from all other sites. Two mul- 

tilogs are comparable if they contain the same set of actions: comparable(M,M') = f K = K' . The 
direct opponents of candidate X in some election are comparable candidates that X does not prefix: 

def w f wf 

opponents(X) — {B\3k : B C proposals^] f\comparable(B,X) /\X \/_ B)} 

However, we must also count missing votes, i.e., the weights of sites whose proposals do not yet 
include all actions in X. Function cotally(X) adds these up: 

def 

cotally(X) = Yi (weight k ,k) 

k:Kx<£K pl . oposals .y k ^ 

Algorithm [5] depicts the election algorithm. A candidate is a well-formed prefix of some pro- 
posal. We ignore already-elected candidates and we only consider eligible ones. A candidate wins 
its election if its tally is greater than the tally of any direct opponent, plus its cotally. Note that, as 
proposals are received, cotally tends towards 0, therefore some candidate is eventually elected. We 
merge the winner into the site-multilog. 

5.5 Example 

We return to our example. Recall that, once Alice and Bob have submitted their actions, and 
site 1 and site 2 have exchanged site-multilogs, both site-multilogs are equal to ({lNlT,a,P},{oc— > 
P,a^ y,y^ a}, {a < p},0). Now Alice (site 1) proposes to guarantee a and p, and to kill y: 
proposals\ [1] = M\ U {p <J INIT}. In the meanwhile, Bob at site 2 proposes to guarantee y and 



RR n° 6069 



14 



Sutra & Barreto & Shapiro 



Algorithm 5 Elect 
I : Let X be a multilog such that: 

3k G J :XC proposals j[k] 
A XgMi 
A eligible(X) 

A tally (X)> max (tally (B)) + cotally(X) 

BGopponents(X) 

2: if such an X exists then 
3: Choose such an X 

4: Mr.=MiUX 



Algorithm 6 MergeProposals(P) 
l: for all k do 

2: if proposals^.ts < P[k].ts then 

3: proposals^] :=P[k] 

4: proposals^.ts :—P[k].ts 



a, and to kill p: proposals2[2] — M2 U {y<l lNlT,a < INIT}. These proposals are incompatible; 
therefore that the commitment protocol will eventually agree on at most one of them. 

Consider now a third site, site 3; assume that the three sites have equal weight 5. Imagine that 
site 3 receives site 2's site-multilog and proposal, and sends its own proposal that is identical to 
site l's. Sometime later, site 3 sends its proposal to site 1. At this point, site 1 has received all sites' 
proposals. Now site 1 might run an election, considering a candidate X equal to proposals\ [1]. X 
is indeed a well-formed prefix of proposalsi [1]; now suppose that X is eligible as all sites have 
voted on Kx\ tally(X) = | is greater than that of X's only opponent (tally(proposals\ [2]) = ^); and 
cotally(X) = 0. 

Therefore, site 1 elects X and merges X into M\ . Any other site will either elect X (or some 
compatible candidate) or become aware of its election by epidemic transmission of M\ . 

6 Discussion 

6.1 Safety proof outline 

Section Q] states our safety property, the conjunction of mergeability and local soundness. Clearly 
Algorithm[3] satisfies local soundness; see lines [T61TT81 We now outline a proof of mergeability. 
We say that candidate X is elected in a run r at time t , if some acceptor i executes Algorithm|5]in 

wf 

r at t, and elects a candidate Y such that X C Y. Given a run r of Algorithm[3] we note Elected(r,t) 
the set of candidates elected in r up to time t (inclusive), and Elected(r) the set of candidates elected 
during r. Observe that, since !M is sound, Algorithm[3]satisfies mergeability in a run r if and only if 
the acceptors elect a sound set of candidates during r ( UxeEiected(r) % i s sound ). 
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Suppose, by contradiction, that during run r, this set is unsound. As M is sound, by J? candidates 
are sound. Consequently there must exist an unsound set of candidates C C Elected(r). Let us now 
consider the following property: 

wf 

Definition 4 (Minimality.) A multilog M is said minimal iff: VM' C M,M' C M =*> M' = M. 

As candidates are eligible, there must exist two candidates X and X' in C such that: (i) X and X' are 
non-compatible, and (ii) X and X' are minimal. 

We define the following notation. Let i (resp. i') be the acceptor that elects X (resp. X 1 ) in 
r. t is the time where i elects X in r (resp. t' for X' on /'). For a proposer k, t\ (resp. t'k) is the 
time at which it sent proposalSi[k](t) to i (resp. proposa/s;/ [£](?') to i'). Q (resp. 2') is the set of 

wf 

proposers that vote for X at t on ; (resp. for X' at t' on r); formally Q = {k\X c proposal,- [£](?)} 

w/ 

and g' = C /?ro/?osa/s,/ [£](?')}. 

Hereafter, and without loss of generality, we suppose that: (i) t < t' , (ii) X is the first candidate 
non-compatible with X' elected in r, and (iii) Electedir, t' — 1) is sound. 

Since i' elects X' at t', at that time on site i'\ 

tally(X')> max (tally (B)) + cotally(X') (1) 

BGopponents(X') 

Equation [T] defines an upper bound for tally(X) on i at t , as follows. Consider some k e Q. 

wf 

If tk < t k then from Algorithm |4] and the fact that Elected(r,t' — 1) is sound, we know that X C 
proposalsji[k](t'). 

If now > t'k, then as tally(X r ), opponents(X r ) and cotally(X') define a partition of J, either: 

1. k has not yet voted on K x i at?' on i' and its weight is counted in cotally(X'). 

2. Or, if its vote already includes K X ', it is counted in opponents(X') as X is the first candidate 

wf 

non-compatible withX' elected in r, X C /?ro/?osa/s, [£](?), and -i compatible (X,X ). 

From these reasonnings (if < f[ and if t' k < t£), and EquationQ] we derive: 

tally v (X 1 ) (t ') > tally t (X) (t) (2) 

where tally k (Z)(z) means the value of tally(Z) computed at time x on site k. 
Now consider some k G Q'. 

If t\ > t'k then X being the first candidate non-compatible with X' elected in r, from Algorithm^] 

wf 

we haveX' C propojfl/j,-^]^). 
If < f'^, now either 

wf 

\. X' \Z proposalsi[k\(t) 

2. or k has not yet voted on X.K on i at t . 
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The reasoning here is similar to k £ Q: we use the minimality of X and X' ', the fact that they are 
non-compatible, and that X is the first candidate non-compatible with X' elected in r. 
From the above, it follows that: 

tally i, (X')(f') < tally^X^^+cotallyiiX)^) (3) 

Now, combining equations [2] and [3j we conclude that, at site i at time t : 

tally {X) < max {tally -(B)) + cotally(X) (4) 

BGopponents(X) 

X cannot be elected on i at t. Contradiction. 

6.2 Time complexity to run an election 

Let M be a site-multilog, and let m be the number of actions in M. We first extract from proposals 
the set of candidates as follows: 

1 . For every proposal P E proposals, for every actions a £ P, we compute the list of predecessors 
by — <i and of a in P. 

2. Let Z be such a list, we then compute d = I f] Dead(P) for every P. 

3. Then for any couple (a, p) € / such that ajf p € Hp, we save the serialization decision: either 
a — * p 6 — >p or P^ae — ►/>. It forms a set of couples 5, containing at most j(m 2 — m) 
elements Q 

A candidate is any tuple X = (l,d,s). According to items 1,2 and 3, the time complexity to extract all 
the candidates in proposals, is at most 0(nm) since all operations can be performed simultaneously. 

We compute cotally(X) by comparing I to P.K for any P £ proposals: 0{mn) operations. Finally 
we divide the remaning proposals into tally(X) and opponentsiX) by comparing Dead(P) and !f( p to 
d and s: 0(n(m + s)) operations. 

Since is symetric, it can exist at most 0(n(m — \/2sj) candidates. Thus we have to consider the 
maximum of the function (m + s)(m— V2s). It follows that s = Im 2 , and that the time complexity 
of the whole election process is (9(m 3 n 2 ). 

6.3 Message cost 

Interestingly, the message cost of our protocol varies with application semantics, along two dimen- 
sions. 

First, the degree of semantic complexity, i.e., the complexity of the client constraint graph 5W, 
influences the number of votes required. To illustrate, consider an application where all actions are 
mutually independent, i.e., M contains no constraints. Then, all actions commute with one another, 

7 It equals the maximum number of edges in a strongly connected graph of size m 
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T -< T T\\T' T' <T 



RS{T)nWS(T')^0 
WS(T)nWS{T')^0 





T^T' 


T'^T AT' <\T 






T' —>T 



Table 1: 5WsER-DB-after : Constraints for a serialisable database that transmits after-values 



and no action never needs to be killed. Every candidate is trivially eligible, and trivially compatible 
with all other candidates. 

Second, call degree of optimism d the size of a batch, i.e., the number of actions that a site may 
execute tentatively before requiring commitment. This measures both that replicas relax consistency 
and that clients propose to the same replica, concurrent commutative actions. It takes a chain of | 
messages to construct a majority. A candidates may contain up to d actions. Therefore, the amortised 
message cost to commit an action is 5 X 4 . 

A more detailed evaluation of message cost is left for future work. 



6.4 Implementation considerations 

Our pseudo-code was written for clarity, not efficiency. Many optimisations are possible. For in- 
stance, a site i does not need to send the whole proposals \ [i]. When sending to /', it suffices to send 
the difference proposals^ [i] \proposalSj[j]. 

Conceptually, a multilog grows without bound. However, a stable action, and all its constraints, 
can safely be deleted. 

Conceptually, our algorithm executes all actions everywhere. A practical implementation only 
needs to achieve an equivalent state; in particular actions that do not have side-effects do not have to 
be replayed. For instance, in a database application, read operations do not to be replayedQ 



6.5 Example application 

We illustrate the application of our algorithm to a replicated database. The semantic constraints 
between two transactions depend on several factors: (i) Whether the transactions are related by 
happens-before or are concurrent, (ii) Whether their read- and write-sets intersect or not. (iii) What 
consistency criterion is being enforced (for instance, constraints differ between serializability and 
snapshot isolation |2|). (iv) How, after executing a transaction on some initial site, the system 
replicates its effects at a remote site: by replaying the transaction, or by applying the after-values 
computed at the initial site. 

8 Formally, we need to generalise the equivalence relation between schedules, which currently is based only on 1 18 1. The 
definition of consistency now becomes that every pair of sites eventually converges to schedules that are equivalent according 
to the new relation. 
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Table Q] exhibits semantic constraints between transactions, where (a) the system replicates a 
transaction by writing its after-values, and (b) transactions are strictly serialisable@ Supporting a 
different semantics, e.g., (a') replaying actions, or (b') SI, requires only some small changes to the 
table. 

7 Related work 

In previous OR systems, commitment was often either centralised at a primary site |[T31 l20ll or 
oblivious of semantics ll7l [T7ll . It is very difficult to combine decentralisation with semantics. 

Our election algorithm is inspired by Keleher's Deno system (8), a pessimistic system, which 
performs a discrete sequence of elections. Keleher proposes plurality voting to ensure progress 
when none of multiple competing proposals gains a majority. The VVWV protocol of Barreto and 
Ferreira generalizes Deno's voting procedure, enabling continuous voting [ 1 1. 

The only semantics supported by Deno or VVWV is to enforce Lamport's happens-before rela- 
tion iflOl ; all actions are assumed be mutually non-commuting. Happens-before captures potential 
causality; however an event may happen-before another even if they are not truly dependent. This 
paper further generalizes VVWV by considering semantic constraints. 

Holliday et al. depict a family of epidemic algorithms to ensure serializability in replicated dat- 
base systems 10. The three algorithms consider that concurrent conflicting transactions are antago- 
nistic. Two of them abort concurrent conflicting transactions, and the last one (quorum-based) can 
only commit one transactions among a set of concurrent conflicting ones. Our algorithm consider 
that concurrent conflicting transactions are not necessarily antagonistic, it tries to optimize the num- 
ber of committed transactions, computing a best-effort proposal , and electing them with plurality. 

ESDS |4| is a decentralised replication protocol that supports some semantics. It allows users to 
create an arbitrary causal dependence graph between actions. ESDS eventually computes a global 
total order among actions, but also includes an optimisation for the case where some action pairs 
commute. ESDS does not consider atomicity or antagonism relations, nor does it consider dead 
actions. 

Bayou [20] supports arbitrary application semantics. User-supplied code controls whether an 
action is committed or aborted. However the system imposes an arbitrary total execution order. 
Bayou centralises decision at a single primary replica. 

IceCube [9 | introduced the idea of reifying semantics with constraints. The IceCube algorithm 
computes optimal proposals, minimizing the number of dead actions. Like Bayou, commitment in 
IceCube is centralised at a primary. Compared to this article, IceCube supports a richer constraint 
vocabulary, which is useful for applications, but harder to reason about formally. 

The Paxos distributed protocol ifTTI computes a total order. Such total order may be used to 
implement state-machine replication [ 10 1, whereby all sites execute exactly the same schedule. Such 
a total order over all actions is necessary only if all actions are mutually non-commuting. In Section|3] 
we showed how to combine semantic constraints with a total order, but this approach is clearly 

<jj ji (j enotes j happens-before T 1101 . T \\ T 1 denotes concurrency, i.e., neither T x T', nor T' -< T. RS(T) and 
WS(T) denote 7"s read set and write set respectively. 
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sub-optimal. Howover, Paxos remains live even if / < | sites crash forever, whereas the other 
systems described here (including ours) block if a site crashes forever. We assume that a site stores 
its multilogs and its proposals in persistent memory, and that after a crash it with its identity and 
persistent store intact. This is a fairly reasonable assumption in a well-managed cooperative system. 
(For instance, each site might actually be implemented as a cluster on a LAN, with redundant storage, 
and strong consistency internally.) 

Generalized Paxos lfl2l and Generic Broadcast lfT4l take commutativity relations into account 
and compute a partial order. They do not consider any other semantic relations. Both Generalized 
Paxos [ 12] and our algorithm make progress when a majority is not reached, although through dif- 
ferent means. Generalized Paxos starts a new election instance, whereas our algorithm waits for a 
plurality decision. 

8 Conclusion and future work 

The focus of our study is cooperative applications with rich semantics. Previous approaches to 
replication did not support a sufficiently rich repertoire of semantics, or relied on a centralized point 
of commitment. They often impose a total order, which is stronger than necessary. 

In contrast, we propose a decentralized commitment protocol for semantically-rich systems. Our 
approach is to reify semantic relations as constraints, which restrict the scheduling behavior of the 
system. According to our formal definition of consistency, the system has an obligation to resolve 
conflicts, and to eventually execute equivalent stable schedules at all sites. 

Our protocol is safe in the absence of Byzantine faults, and live in the absence of crashes. It 
uses voting to avoid any centralization bottleneck, and to ensure that the result is similar to local 
proposals. It uses plurality voting to make progress even when an election does not reach a majority. 

There is an interesting trade-off in the proposal/voting procedure. The system might decide fre- 
quently, in small increments, so that users quickly know whether their tentative actions are accepted 
or rejected. However this might be non-optimal as it may cut off interesting future behaviors. Or 
it may base its decisions on a large batch of tentative actions, deciding less frequently. This im- 
poses more uncertainty on users, but decisions may be closer to the optimum. We plan to study this 
trade-off in our future work. 

Another future direction is partial replication. In such a system, a site receives only the actions 
relative to the objects it replicates (and their constraints). A site votes only on the actions it knows. 
Because constraints might relate actions known only by distinct sites, these sites must agree together; 
however we expect that global agreement is rarely necessary. By exploiting knowledge of semantic 
constraints, we hope to limit the scope of a commitment protocol to small-scale agreements, instead 
of a global consensus. 
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Abstract: We study large-scale distributed cooperative systems that use optimistic replication. We 
represent a system as a graph of actions (operations) connected by edges that reify semantic con- 
straints between actions. Constraint types include conflict, execution order, dependence, and atom- 
icity. The local state is some schedule that conforms to the constraints; because of conflicts, client 
state is only tentative. For consistency, site schedules should converge; we designed a decentralised, 
asynchronous commitment protocol. Each client makes a proposal, reflecting its tentative and/or 
preferred schedules. Our protocol distributes the proposals, which it decomposes into semantically- 
meaningful units called candidates, and runs an election between comparable candidates. A candi- 
date wins when it receives a majority or a plurality. The protocol is fully asynchronous: each site 
executes its tentative schedule independently, and determines locally when a candidate has won an 
election. The committed schedule is as close as possible to the preferences expressed by clients. 

Key-words: data replication, optimistic replication, semantic replication, commitment, voting 
protocols. 
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Un protocole de validation pour la replication optimiste dans les 
systemes repartis semantiquement riches 

Resume : Nous examinons a travers ce document la coherence dans les systemes repartis repliquant 
des donnees de maniere optimiste. Le paradigme de la replication optimiste est que les sites com- 
posant le systeme reparti peuvent re-executer les requetes des clients (actions) si la semantique liant 
les actions le necessite. Dans de tels systemes le critere de coherence est que les sites convergent a 
terme vers des executions equivalentes. Afin d' assurer cette convergence, un protocole de validation 
est necessaire. C'est l'objet de cette etude. Notre protocole procede par elections successives sur des 
ensembles d' actions executees de maniere optimiste par le systeme. La semantique prise en compte 
dans ce protocole est suffisament riche pour exprimer des notions telles que la non-commutativite, 
le conflit ou encore la causalite entre les actions. Nous prouvons que notre protocole est sur, et ce en 
depit des eventuelles pannes franches pouvant survenir sur les sites. 

Mots-cles : replication optimiste, validation, protocoles de vote 
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1 Introduction 

In a large-scale cooperative system, access to shared data is a performance and availability bottle- 
neck. One solution is optimistic replication (OR), where a process may read or update its local 
replica without synchronising with remote sites 1 17 1. OR decouples data access from network ac- 
cess. 

In OR, each site makes progress independently, even while others are slow, currently discon- 
nected, or currently working in isolated mode. OR is well suited to peer-to-peer systems and to 
devices with occasional connectivity. 

Some limited knowledge of semantics provides a lot of extra power and flexibility. Therefore, 
we model the system as a graph, called a multilog, where each vertex represents an action (i.e., an 
operation proposed by some client), and an edge is a semantic relation between vertices, called a 
constraint. Our constraints include conflict, ordered execution, causal dependence, and atomicity. 
Each site has its own multilog, which contains actions submitted by the local client, and their con- 
straints, as well as those received from other sites. The current state is some execution schedule 
that contains actions from the site's multilog, arranged to conform with its constraints. For instance, 
when actions are antagonistic, at least one must abort; an action that depends on an aborted action 
must abort too; non-commutative actions should be scheduled in the same order everywhere, etc. 
The site may choose any conforming schedule, e.g., one that minimises aborts, or one that reflects 
user preferences. 

For consistency, sites should agree on a common, stable and correct schedule. We call this 
agreement commitment. Some cooperative OR systems never commit, such as Roam 1 16 1 or Draw- 
Together |6|. Previous work on commitment for semantic OR such as Bayou |20| or IceCube 1 15| 
centralises the agreement at a central site. Other work decentralises commitment (e.g., Paxos con- 
sensus 1 11 1) but ignores semantics. It is difficult to reconcile semantics and decentralisation. One 
possible approach would use Paxos to compute a total order, and abort any actions for which this 
order would violate a constraint. However this approach aborts actions unnecessarily. Furthermore, 
the arbitrary total order may be very different from what users expect. 

A better approach is to order only non-commuting pairs of actions, to abort only when actions 
are antagonistic, to minimise dependent aborts, and to remain close to user expectations. We pro- 
pose an efficient, decentralised protocol that uses semantic information for this purpose. Partic- 
ipating sites make and exchange proposals asynchronously; our algorithm decomposes each one 
into semantic ally-meaningful candidates; it runs elections between comparable candidates. A can- 
didate that collects a majority or a plurality wins its election. Voting ensures that the common 
schedule is similar to the tentative schedules, minimising user surprise. Our protocol orders only 
non-commuting actions and minimises unnecessary aborts. 

This paper makes several contributions: 

• Our algorithm combines a number of known techniques in a novel manner. 

• We identify the concept of a semantically-meaningful unit for election (which we call a can- 
didate). 
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• We propose an efficient commitment protocol system that is both decentralised and semantic- 
oriented, and that has weak communication requirements. 

• We show how to minimise user surprise, the committed schedule being similar to local tenta- 
tive schedules. 

• We prove that the protocol is safe even in the presence of non-byzantine faults. The protocol 
is live as long as a sufficient number of votes are received. 

The outline of this paper follows. Section [2] introduces our system model and our vocabulary. 
Section[3]discusses an abstraction of classical OR approaches that is later re-used in our algorithm. 
Section 0] specifies client behaviour. Our commitment protocol is specified in Section |5] Section |6] 
provides a proof outline and adresses message cost. We compare with related work in Section^ In 
conclusion, Section[8]discusses our results and future work. 



2 System model 

Following the ACF model 1181 . an OR system is an asynchronous distributed system of n sites 
i, /',... 6 J . A site that crashes eventually recovers with its identity and persistent memory intact 
(but may miss some messages in the interval). Clients propose actions (deterministic operations) 
noted a, p, . . . £ A. An action might request, for instance, "Debit 100 euros from bank account 
number 12345." 

A multilog is a quadruple M = (K,—>, <,!/(), representing a graph where the vertices K are ac- 
tions, and — >, < and jft (pronounced Not After, Enables and NonCommuting respectively) are three 
sets of edges called constraints. We will explain their semantics shortlyQ 

We identify a state with a schedule S, a sequence of distinct actions ordered by <$ executed 
from the common initial state INIT. The following safety condition defines semantics of NotAfter 
and Enables in relation to schedules. We define E(M), the set of schedules S that are sound with 
respect to multilog M, as follows: 



SeE(M)=Va,peA,<{ 



init e S 

ae S=> ae K 

a e S A a ^ init init < s a 
(a^p)Aa,peS^a< s p 



Constraints represent scheduling relations between actions: NotAfter is a (non-transitive) ordering 
relation and Enables is right-to-left (non-transitive) implication!! 



'Multilog union, inclusion, difference, etc., are defined as component-wise union, inclusion, difference, etc., respectively. 
For instance if M = (Jt,-»-, andM' = (£',->', their union is MUM' = (KU K' , -> U ->', < U <'4u|f). 

2 A constraint is a relation in Ax A. By abuse of notation, for some relation 5^., we write equivalently (a %, p) G M or 
a X. p or (a, p) e Jfis symmetric and < is reflexive. They do not have any further special properties; in particular, — » and 
<3 are not transitive, are not orders, and may be cyclic. 
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Constraints represent semantic relations between actions. For instance, consider a database sys- 
tem (more precisely, a serialisable database that transmits transactions by value, such as DBSM 
I13I V Assume shared variables x,y,z are initially zero. Two concurrent transactions T\ =r(x)0;w(z)l 
and T2 = w(x)2 are related by T\ — > T%, since T\ read a value that precedes T^'s write|^| T\ and 
T3 = r(z)0; w(x)3 are antagonistic, i.e., one or the other (or both) must abort, as each is NotAfter the 
other. In the execution T\\ T4 where T4 = r(z)l, the latter transaction depends causally on the former, 
i.e., they may run only in that order, and T4 aborts if T\ aborts; we write T\ — ► T4 A T\ <\T4. As 
another example, Section lo31 discusses how to encode the semantics of database transactions with 
constraints. 

Non-commutativity imposes a liveness obligation: the system must put a NotAfter between non- 
commuting actions, or abort one of them. (Therefore, non-commutativity does not appear in the 
above safety condition.) The system also has the obligation to resolve antagonisms by aborting 
actions. 

For instance, transactions T\ and T$ = r(y)Q commute if x,y and z are independent. In a database 
system that commits operations (as opposed to commiting values), transactions 7g ="Credit 66 euros 
to Account 12345" and Tj ="Credit 77 euros to Account 12345" commute since addition is a com- 
mutative operation, but T$ and 7g ="Debit 88 euros from Account 12345" do not, if bank accounts 
are not allowed to become negative. We write 76 jfTg. 

Order, antagonism and non-commutativity are collectively called conflicts^ 

Clients submit actions to their local site; sites exchange actions and constraints asynchronously. 
The current knowledge of Site i at time t is the distinguished site-multilog M, (f ). Initially, M, (0) = 
({init}. 0, 0, 0), and it grows over time, as we will explain later. A site's current state is the 
site-schedule Si{t), which is some (arbitrary) schedule G E(M,(f )). 

An action executes tentatively only, because of conflicts and related issues. However, an action 
might have sufficient constraints that its execution is stable. We distinguish the following interesting 
subsets of actions relative to M. 

• Guaranteed actions appear in every schedule of E(M). Formally, Guar(M) is the smallest 
subset of K satisfying: INIT G Guar(M) A ((a G Guar(M) A p <3 a) P € Guar(M)). 

• Dead actions never appear in a schedule of E(M). Dead(M) is the smallest subset of A sat- 
isfying: (((Xi , . . . , a,„>o e Guar{M)) A (p — > (Xi — > . . . — * a,„ — * P) => P £ Dead(M)) A ((a G 
Dead(M) A a < p) => f 1 G Dead{M)). 

• Serialised actions are either dead or ordered with respect to all non-commuting constraints. 

Hpf 

Serialised(M) = {a G #|Vp G^„a||-p^a^pVp^aVPG Dead(M) V a G Dead(M)}. 

def 

• Decided actions are either dead, or both guaranteed and serialised. Decided(M) — Dead(M) U 
(Guar(M) (~)Serialised(M)). 

i r(x)n stands for a read of x returning value n, and w(x)n writes the value n into x. 

4 Some authors suggest to remove conflicts by transforming the actions 1191 . We assume that, if such transformations are 
possible, they have already been applied. 
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(Guarantee P) p < INIT 



Figure 1: ^conservative (<) : Applying semantic constraints to a given total order 

• Stable (i.e., durable) actions are decided, and all actions that precede them by NotAfter or En- 

def 

ables are themselves stable: Stable(M) = Dead(M)U{a£ Guar(M)CiSerialised(M)\V$ £ A, (P 
aVp<m)^pe Stable(M)}. 

To decide an action a relative to a multilog M, means to add constraints to the M, such that 
a £ Decided(M). In particular, to guarantee a, we add a < INIT to the multilog, and to kill a, we 
add a — > a; to serialise non-commuting actions a and p, we add either a — > p, p — > a, a — ► a, or 
P-P- 

Multilog M is said sound iff E(M) ^ 0, or equivalently, iff Dead(M) n Guar(M) = 0. An 
unsound multilog is definitely broken, i.e., no possible schedule can satisfy all the constraints, not 
even the empty schedule. 

Referring to the standard database terminology, a committed action is one that is both stable and 
guaranteed, and aborted is the same as dead. 

The standard correctness condition in OR systems is Eventual Consistency: if clients stop sub- 
mitting, eventually all sites reach the same state. We extend this definition by not requiring that 
clients stop, by requiring that all states be correct, and by demanding decision. 

Definition 1 (Eventual Consistency.) An OR system is eventually consistent iff it satisfies all the 
following conditions: 

• Local soundness (safety): Every site-schedule is sound: Vi,t,Si(t) £ E(M,(f)) 

• Mergeability (safety): The union of all the site-multilogs over time is sound: 

e(Um,(O)^0 

i.t 

• Eventual propagation (liveness): V/, j G J ,Vf,3f' : M;(f) C Mj(t') 

• Eventual decision (liveness): Every submitted action is eventually decided: 

Va e A, Vi £ y , Vr , 3t' : K^t) C Decided(Mi(t')) 

We assume some form of epidemic communication to fulfill Eventual Propagation. A commit- 
ment algorithm aims to fulfill the obligations of Eventual Decision. Of course, it must also satisfy 
the safety requirements. 
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3 Classical OR commitment algorithms 

Our proposal builds upon existing commitment algorithms for OR systems. Generally, these either 
are centralised or do not take constraints into account. We note A (M) some algorithm that offers 
decisions based on multilog M ; with no loss of generality, we focus on the outcome of A at a single 
site. Assuming M is sound, and noting the result M' — A (M), A must satisfy these requirements: 

• A extends its input: M <ZM' . 

• A may not add actions: K' = K. 

• A may add constraints, which are restricted to decisions: 

oc^'p =!> (a^p)V(c4p)V(p = a) 
a<'P (a<P)V(P = iNiT) 

= # 

• M' is sound. 

• M' is stable: S table (M') = K. 

A could be any algorithm satisfying the requirements. 

One possible algorithm, ^conservative (<), first orders actions, then kills actions for which the 
order is unsafe. It proceeds as follows (see Figure Let < be a total order of actions and M a 
sound multilog. The algorithm decides one action at at time, varying over all actions, left to right; 
call the current action p. Consider actions a and y such that a < P < y: a has already been decided, 
and y has not. If P jfy, then serialise them in schedule order. If P — > a, and a is guaranteed, kill P, 
because the schedule and the constraint are incompatible. If y< p, conservatively kill P, because it 
is not known whether y can be guaranteed. By definition, if a <J p and a is dead, then p is dead. If P 
is not dead by any of the above rules, then decide p guaranteed (by adding p <J INIT to the multilog). 
The resulting £(-#conservative(<)(^)) contains a unique schedule. 

It should be clear that this approach is safe but tends to kill actions unnecessarily. 

The Bayou system |20| applies conservative (<), where < is the order in which actions are re- 
ceived at a single primary site. An action aborts if it fails an application-specific precondition, which 
we reify as a — > constraint. 

In the Last-Writer- Wins (LWW) approach |7 1, an action (completely overwriting some datum) is 
stamped with the time it is submitted. Two actions that modify the same datum are related by — > in 
timestamp order. Sites execute actions in arbitrary order and apply ^Conservative (<)■ Consequently, a 
datum has the state of the most recent write (in timestamp order). 

The decisions computed by the above systems are mostly arbitrary. A better way would be to 
minimise aborts, or to follow user preferences, or both. This was the approach of the IceCube system 
1151 . J?iceCube is an optimization algorithm that minimises the number of dead actions in -#i ce Cube (M) . 
It does so by heuristically comparing all possible sound schedules that can be generated from the 
current site-multilog. The system suggests a number of possible decisions to the user, who states his 
preference. 
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Algorithm 1 ClientActionsConstraints(L) 

Require: LCA 

1: Ki-=KiUL 

2: for all (a, (3) e Kj x Kj such that a—> M (3 do 

3: -i:=-,u{(a,P)} 

4: for all (a, (3) e Kj x Ki such that a <\ M p do 

5: <;:=<;U{(a,P)} 

6: for all (a, p) e Ki x Kj such that a)/[ M p do 
7: t<:=tiU{(a,P)} 



Except for LWW, which is decentralised but deterministic, the above algorithms centralise com- 
mitment at a primary site. 

To decentralise decision, one approach might be to determine a global total order <, using a 
decentralised consensus algorithm such as Paxos [11], and apply ^Conservative (<)■ As above, this 
order is arbitrary and ^conservative (<) tends to kill unnecessary. Instead, our algorithm allows each 
site to propose decisions that minimises aborts and follows local client preferences, and to reach 
consensus on these proposals in a decentralised manner. This is the subject of the rest of this paper. 

4 Client operation 

We now begin the discussion of our algorithm. We start with a specification of client behaviour. 

4.1 Client Behaviour and client interaction 

An application performs tentative operations by submitting actions and constraints to its local site- 
multilog; they will eventually propagate to all sites. 

We abstract application semantics by postulating that clients have access to a sound multilog con- 
taining all the semantic constraints: *M = (A, — * M , <\ M ). For an example M , see Section l631 

As the client submits actions L to the site-multilog, function CHentActionsConstraints (Algo- 
rithmQ adds constraints with respect to actions that the site already knows0 

To illustrate, consider Alice and Bob working together. Alice uses their shared calendar at site 1, 
and Bob at site 2. Planning a meeting with Bob in Paris, Alice submits two actions: a ="Buy 
train ticket to Paris next Monday at 10:00" and p ="Attend meeting". As p depends causally on a, 
M contains a — ^ p A a <l M p. Alice calls ClientActionsConstraints({<x\) to add action a to site- 
multilog Mi, and, some time later, similarly for p. At this point, Algorithm [2 adds the constraints 
a — * p and a < P taken from M . 

5 In the pseudo-code, we leave the current time t implicit. A double-slash and sans-serif font indicates a comment, as in 
// This is a comment. 
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Algorithm 2 ReceiveAndCompare(M) 

Declare: M = (K,—>, <,!/() a multilog receives from a remote site 
Mr- = MiUM 

for all (a, p) € x Kj such that a— > M p do 

-i:=-iU{(a,P)} 
for all (a, p) G /f, x Kj such that oc^ p do 

ft:=#,U{(a,P)} 



4.2 Multilog Propagation 

When a client adds new actions L into a site-multilog, L and the constraints computed by ClientActionsConstraints, 
form a multilog that is sent to remote sites. Upon reception, receivers merge this multilog into their 
own site-multilog. By this so-called epidemic communication |3 1, every site eventually receive all 
actions and constraints submitted at any site. 

When site i receives a multilog M, it executes function Receive AndCompare (Algorithm 0, 
which first merges what it received into the local site-multilog. Then, if any conflicts exist between 
previously-known actions and the received ones, it adds the corresponding constraints to the site- 
multilog 

Let us return to Alice and Bob. Suppose that Bob now adds action y, meaning "Cancel the 
meeting," to Mi, Action y is antagonistic with action p; hence, P~ > w yAy— > af p. Some time 
later, site 2 sends its site-multilog to site 1; when site 1 receives it, it runs Algorithmic notices the 
antagonism, and adds constraint p ^yAy^ p to M\ . Thereafter, site-schedules at site 1 may include 
either p or y, but not both. 

5 A decentralised commitment protocol 

Epidemic communication ensures that all site-multilogs eventually receive all information, but site- 
schedules might still differ between sites. 

For instance, let us return to Alice and Bob. Assuming users add no more actions, eventually 
all site-multilogs become ({iNiT,a,p,y},{a^ p,p— >y,y— > P},{a< P},0). In this state, actions 
remain tentative; at time t, site 1 might execute S\(t) = INIT;oc;P, site 2 52(f) = lNlT;oc;y, and just 
INIT at t + 1 . A commitment protocol ensures that a, p and y eventually stabilise, and that both 
Alice and Bob learn the same outcome. For instance, the protocol might add p < INIT to M\ , which 
guarantees P, thereby both guaranteeing a and killing y. a, p and y are now decided and stable at 
site 1. M\ eventually propagates to other sites; and inevitably, all site-schedules eventually start with 
INIT; a; P, and y is dead everywhere. 

6 ClientActionsConstraints provides constraints between successive actions submitted at the same site. These con- 
sist typically of dependence and atomicity constraints. In contrast, ReceiveAndCompare computes constraints between 
independently-submitted actions. 
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5.1 Overview 

Our key insight is that eventual consistency is equivalent to the property that the site-multilogs of 
all sites share a common well-formed prefix (defined hereafter) of stable actions, which grows to 
include every action eventually. Commitment serves to agree on an extension of this prefix. As 
clients continue to make optimistic progress beyond this prefix, the commitment protocol can run 
asynchronously in the background. 

In our protocol, different sites run instances of a to make proposals; a proposal being a tentative 
well-formed prefix of its site-multilog. Sites agree via a decentralised election. This works even if 
J? is non-deterministic, or if sites use different A algorithms. We recommend IceCube 1 15 1 but any 
algorithm satisfying the requirements of Sectionals suitable. 

In what follows, i represents the current site, and j, k range over J7 . 

We distinguish two roles at each site, proposers and acceptors. Each proposer has a fixed weight, 
such that Y,ke j weight k = 1 . In practice, we expect only a small number of sites to have non-zero 
weights (in the limit one site might have weight 1, this is a primary site as in Section[3), but the safety 
of our protocol does not depend on how weights are allocated. To simplify exposition, weights are 
distributed ahead of time and do not change; it is relatively straightforward to extend the current 
algorithm, allowing weights to vary between successive elections. 

An acceptor at some site computes the outcome of an election, and inserts the corresponding 
decision constraints into the local site-multilog. 

Each site stores the most recent proposal received from each proposer in array proposalst, of 
size n (the number of sites). To keep track of proposals, each entry proposalsi[k] carries a logical 
timestamp, noted proposalsi[k].ts. Timestamping ensures the liveness of the election process despite 
since links between nodes are not necessarily FIFO. 

Each site performs Algorithm^ First it initialises the site-multilog and proposals data structures, 
then it consists of a number of parallel iterative threads, detailed in the next sections. Within a thread, 
an iteration is atomic. Iterations are separated by arbitrary amounts of time. 

5.2 Epidemic communication 

The first two threads dinesl3 HToT > exchange multilogs and proposals between sites. Function Receive AndCompare 
(defined in Algorithm |2] Section l4!2t compares actions newly received to already -known ones, in 
order to compute conflict constraints. In Algorithm[6]a receiver updates its own set of proposals with 
any more recent ones. 

5.3 Client, local state, proposer 

The third thread (lines ITZl414l > constitutes one half of the client. An application submits tentative 
operations to its local site-multilog, which the site-schedule will (hopefully) execute in the fourth 
thread. Constraints relating new actions to previous ones are included at this stage by function 
ClientActionsConstraints (defined in Algorithm^!. 

The other half of the client is function ReceiveAndCompare (Algorithmic invoked in the second 
thread (line|9). 
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Algorithm 3 Algorithm at site i 



Declare: M,: local site-multilog 

Declare: proposals , [«]: array of proposals, indexed by site; a proposal is a multilog 
M,:= ({init}, 0,0,0) 

proposals^ [(({init}, 0, 0, 0),O), . . . , (({init}, 0, 0, 0),O)] 
loop // Epidemic transmission 
Choose ./' 7^ ;'; 

Send copy of M ; and proposals t to 7 

II 

loop // Epidemic reception 

Receive multilog M and proposals P from some site j ^ i 
ReceiveAndCompare(M) II Compute conflict constraints 

MergeProposals(P) 

II 

loop // Client submits 
Choose L C A 

ClientActionsConstraints{L) II Submit actions, compute local constraints 

II 

loop // Compute current local state 
Choose Si e L(Mi) 
Execute Sj 

II 

loop // Proposer 

UpdateProposal II Suppress redundant parts 

proposalsj\i] := & (M; U proposals^) II New proposal, keeping previous 
Increment proposalsj [2] J s 

II 

loop //Acceptor 

Elect 



The fourth thread (lines I16H181 computes the current tentative state by executing some sound 
site-schedule. 

The fifth thread ( I20H23I computes proposals by invoking A . A proposal extends the current site- 
multilog with proposed decisions. A proposer may not retract a proposal that was already received 
by some other site. Passing argument M, U proposals t [i] to A ensures that these two conditions are 
satisfied. 

However, once a candidate has either won or lost an election, it becomes redundant; UpdateProposal 
removes it from the proposal (Algorithmic). 

The last thread is described in the next section. 
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Algorithm 4 UpdateProposal 
1: LetP = {Kpi—*p,<p,%p) — proposals^] 
2: K P :=K P \Decided{Mi) 
3: -^p:=—ypf]Kp x K P 
4: <\ P :=<l P nKpxKp 

5: -H>:=0 

6: proposalS([i] :=P 



5.4 Election 

The last thread (I25H26I conducts elections. Several elections may be taking place at any point in 
time. An acceptor is capable of determining locally the outcome of elections. A proposal can be 
decomposed into a set of eligible candidates. 

5.4.1 Eligible candidates 

A candidate cannot be just any subset of a proposal. Consider, for instance, proposal P = ({lNlT,oc,y}, {a 
Y,y— > a,a^ a}, {y< init}, 0), and some candidate X extracted from P. If X could contain y and 
not a, then we might guarantee y without killing a, which would be incorrect. According to this 
intuition, X must be a well-formed prefix of P: 

Definition 2 (Well-formed prefix.) Let M = <,!/[) and M' = {K',—> r , <\',Jtf) be two multi- 

. wf 

logs. M is a well-formed prefix of M , noted M C M, if (i) it is a subset ofM , (ii) it is stable, (Hi) 
it is left-closed for its actions, and ( ivj it is closed for its constraints. 



A well-formed prefix is a semantically-meaningful unit of proposal. For instance, if a — > or < cycle 
is present in M , every well-formed prefix either includes the whole cycle, or none of its actions. 

Unfortunately, because of concurrency and asynchronous communication, it is possible that 
some sites know of a — > cycle and not others; or more embarassingly, that sites know only parts 
of a cycle. Therefore we also require the following property: 

Definition 3 (Eligible candidates.) An action is eligible in set L if all its predecessors by client 
NotAfter, Enables and NonCommuting relations are in L. A candidate multilog M is eligible if all 

def 

actions in K are eligible inK: eligible(M) = Va,P € A X K, (a^ M PVoqf^ p Va<S. M p) => ae K. 



' M'CM 
K' = Stable(M') 




Va,peA,p<E£'^ { a<p^a<'p 
{ a^p^a^p 
Va, p e A, (a -V p v a <' p v ajf P) a, p e K' 
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To compute eligibility precisely would require local access to the distributed state, which is 
impossible. Therefore acceptors must compute a safe approximation (i.e., false negatives are al- 
lowed) of eligibility. For instance, in the database example, a sufficient condition for transaction 
T to be eligible at site i is that all transactions submitted (at any site) concurrently with T are also 
known at site i. Indeed, all such transactions have gone through either ClientActionsConstraints or 
ReceiveAndCompare; hence according to Tabled T is eligible. 

5.4.2 Computation of votes 

We define a vote as a pair (weight, siteld). The comparison operator for votes breaks ties by com- 

def 

paring site identifiers: (w,i) > (w , z ) = w > w V (w = w A i > i ). Therefore, votes add up as 
follows: (w,i) + (w',i') == (w + w',max(/,/')). Candidates are compatible if their union is sound: 

def 

compatible(M ,M ) = E(MUM') 7^ 0. The votes of compatible candidates add up; tally(X) com- 
putes the total vote for some candidate X: 

tally(X)'= ^ (weighty, k) 

k:X!ZproposalSj[k] 

An election pits some candidate against comparable candidates from all other sites. Two mul- 

tilogs are comparable if they contain the same set of actions: comparable(M,M') = f K = K' . The 
direct opponents of candidate X in some election are comparable candidates that X does not prefix: 

def w f wf 

opponents(X) — {B\3k : B C proposals^] f\comparable(B,X) /\X \/_ B)} 

However, we must also count missing votes, i.e., the weights of sites whose proposals do not yet 
include all actions in X. Function cotally(X) adds these up: 

def 

cotally(X) = Yi (weight k ,k) 

k:Kx<£K pl . oposals .y k ^ 

Algorithm |3 depicts the election algorithm. A candidate is a well-formed prefix of some pro- 
posal. We ignore already-elected candidates and we only consider eligible ones. A candidate wins 
its election if its tally is greater than the tally of any direct opponent, plus its cotally. Note that, as 
proposals are received, cotally tends towards 0, therefore some candidate is eventually elected. We 
merge the winner into the site-multilog. 

5.5 Example 

We return to our example. Recall that, once Alice and Bob have submitted their actions, and 
site 1 and site 2 have exchanged site-multilogs, both site-multilogs are equal to ({lNlT,a,P},{oc— > 
P,a^ y,y^ a}, {a < p},0). Now Alice (site 1) proposes to guarantee a and p, and to kill y: 
proposals\ [1] = M\ U {p <J INIT}. In the meanwhile, Bob at site 2 proposes to guarantee y and 
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Algorithm 5 Elect 
I : Let X be a multilog such that: 

3k G J :XC proposals j[k] 
A XgMi 
A eligible(X) 

A tally (X)> max (tally (B)) + cotally(X) 

BGopponents(X) 

2: if such an X exists then 
3: Choose such an X 

4: Mr.=MiUX 



Algorithm 6 MergeProposals(P) 
l: for all k do 

2: if proposals^.ts < P[k].ts then 

3: proposals^] :=P[k] 

4: proposals^.ts :—P[k].ts 



a, and to kill p: proposals2[2] — M2 U {y<l lNlT,a < INIT}. These proposals are incompatible; 
therefore that the commitment protocol will eventually agree on at most one of them. 

Consider now a third site, site 3; assume that the three sites have equal weight 5. Imagine that 
site 3 receives site 2's site-multilog and proposal, and sends its own proposal that is identical to 
site l's. Sometime later, site 3 sends its proposal to site 1. At this point, site 1 has received all sites' 
proposals. Now site 1 might run an election, considering a candidate X equal to proposals\ [1]. X 
is indeed a well-formed prefix of proposalsi [1]; now suppose that X is eligible as all sites have 
voted on Kx\ tally(X) = | is greater than that of X's only opponent (tally(proposals\ [2]) = ^); and 
cotally(X) = 0. 

Therefore, site 1 elects X and merges X into M\ . Any other site will either elect X (or some 
compatible candidate) or become aware of its election by epidemic transmission of M\ . 

6 Discussion 

6.1 Safety proof outline 

Section [2 states our safety property, the conjunction of mergeability and local soundness. Clearly 
Algorithm[3]satisfies local soundness; see lines ll6lTl8l We now outline a proof of mergeability. 
We say that candidate X is elected in a run r at time t , if some acceptor i executes Algorithm[5]in 

wf 

r at t, and elects a candidate Y such that X C Y. Given a run r of Algorithm^] we note Elected(r,t) 
the set of candidates elected in r up to time t (inclusive), and Elected(r) the set of candidates elected 
during r. Observe that, since fM is sound, Algorithm[3]satisfies mergeability in a run r if and only if 
the acceptors elect a sound set of candidates during r ( \JxeEiected(r)X i s sound ). 
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Suppose, by contradiction, that during run r, this set is unsound. As M is sound, by J? candidates 
are sound. Consequently there must exist an unsound set of candidates C C Elected(r). Let us now 
consider the following property: 

wf 

Definition 4 (Minimality.) A multilog M is said minimal iff: VM' C M,M' C M =*> M' = M. 

As candidates are eligible, there must exist two candidates X and X' in C such that: (i) X and X' are 
non-compatible, and (ii) X and X' are minimal. 

We define the following notation. Let i (resp. i') be the acceptor that elects X (resp. X 1 ) in 
r. t is the time where i elects X in r (resp. t' for X' on /'). For a proposer k, t\ (resp. t'k) is the 
time at which it sent proposalSi[k](t) to i (resp. proposa/s;/ [£](?') to i'). Q (resp. 2') is the set of 

wf 

proposers that vote for X at t on ; (resp. for X' at t' on r); formally Q = {k\X c proposal,- [£](?)} 

w/ 

and g' = C proposa/s,-/[A:](f')}. 

Hereafter, and without loss of generality, we suppose that: (i) t < t' , (ii) X is the first candidate 
non-compatible with X' elected in r, and (iii) Electedir, t' — 1) is sound. 

Since i' elects X' at t', at that time on site i'\ 

tally(X')> max (tally (B)) + cotally(X') (1) 

BGopponents(X') 

Equation [0 defines an upper bound for tally(X) on i at f, as follows. Consider some k e Q. 

If tk < t k then from Algorithm|4| and the fact that Elected(r,t' — 1) is sound, we know that X C 
proposalsji[k](t'). 

If now tk > t'k, then as tally(X r ), opponents(X r ) and cotally(X') define a partition of J, either: 

1. k has not yet voted on K x i at?' on i' and its weight is counted in cotally(X'). 

2. Or, if its vote already includes K X ', it is counted in opponents(X') as X is the first candidate 
non-compatible withX' elected in r, X C /?ro/?osa/s, [£](?), and -i compatible (X,X ). 

From these reasonnings (if f# < f[ and if t' k < t£), and Equation^ we derive: 

tally v (X 1 ) (t ') > tally t (X) (t) (2) 

where tally k (Z)(z) means the value of tally(Z) computed at time x on site k. 
Now consider some k G Q'. 

If t\ > t'k then X being the first candidate non-compatible with X' elected in r, from Algorithm^ 

wf 

we haveX' C propojfl/j,-^]^). 
If < f'^, now either 

1. X' C proposaZs, [A:](f) 

2. or has not yet voted on X./f on i at f . 
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The reasoning here is similar to k £ Q: we use the minimality of X and X' ', the fact that they are 
non-compatible, and that X is the first candidate non-compatible with X' elected in r. 
From the above, it follows that: 

tally i, (X')(f') < tally^X^^+cotallyiiX)^) (3) 

Now, combining equations|2]and|3j we conclude that, at site i at time t : 

tally {X) < max {tally -(B)) + cotally(X) (4) 

BGopponents(X) 

X cannot be elected on i at t. Contradiction. 

6.2 Time complexity to run an election 

Let M be a site-multilog, and let m be the number of actions in M. We first extract from proposals 
the set of candidates as follows: 

1 . For every proposal P E proposals, for every actions a £ P, we compute the list of predecessors 
by — <i and of a in P. 

2. Let I be such a list, we then compute d = I f] Dead(P) for every P. 

3. Then for any couple (a, p) € / such that ajf p € Hp, we save the serialization decision: either 
a — * p 6 — >p or P^ae — ►/>. It forms a set of couples 5, containing at most j(m 2 — m) 
elements Q 

A candidate is any tuple X = (l,d,s). According to items 1,2 and 3, the time complexity to extract all 
the candidates in proposals, is at most 0(nm) since all operations can be performed simultaneously. 

We compute cotally(X) by comparing I to P.K for any P £ proposals: 0{mn) operations. Finally 
we divide the remaning proposals into tally(X) and opponentsiX) by comparing Dead(P) and !f( p to 
d and s: 0(n(m + s)) operations. 

Since is symetric, it can exist at most 0(n(m — \/2sj) candidates. Thus we have to consider the 
maximum of the function (m + s)(m— V2s). It follows that s = Im 2 , and that the time complexity 
of the whole election process is (9(m 3 n 2 ). 

6.3 Message cost 

Interestingly, the message cost of our protocol varies with application semantics, along two dimen- 
sions. 

First, the degree of semantic complexity, i.e., the complexity of the client constraint graph 5W, 
influences the number of votes required. To illustrate, consider an application where all actions are 
mutually independent, i.e., M contains no constraints. Then, all actions commute with one another, 

7 It equals the maximum number of edges in a strongly connected graph of size m 
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T -< T T\\T' T' <T 



RS{T)nWS(T')^0 
WS(T)nWS{T')^0 





T^T' 


T'^T AT' <\T 






T' —>T 



Table 1: 5WsER-DB-after : Constraints for a serialisable database that transmits after-values 



and no action never needs to be killed. Every candidate is trivially eligible, and trivially compatible 
with all other candidates. 

Second, call degree of optimism d the size of a batch, i.e., the number of actions that a site may 
execute tentatively before requiring commitment. This measures both that replicas relax consistency 
and that clients propose to the same replica, concurrent commutative actions. It takes a chain of | 
messages to construct a majority. A candidates may contain up to d actions. Therefore, the amortised 
message cost to commit an action is 5 X 4 . 

A more detailed evaluation of message cost is left for future work. 



6.4 Implementation considerations 

Our pseudo-code was written for clarity, not efficiency. Many optimisations are possible. For in- 
stance, a site i does not need to send the whole proposals \ [i]. When sending to /', it suffices to send 
the difference proposals^ [i] \proposalSj[j]. 

Conceptually, a multilog grows without bound. However, a stable action, and all its constraints, 
can safely be deleted. 

Conceptually, our algorithm executes all actions everywhere. A practical implementation only 
needs to achieve an equivalent state; in particular actions that do not have side-effects do not have to 
be replayed. For instance, in a database application, read operations do not to be replayed0 



6.5 Example application 

We illustrate the application of our algorithm to a replicated database. The semantic constraints 
between two transactions depend on several factors: (i) Whether the transactions are related by 
happens-before or are concurrent, (ii) Whether their read- and write-sets intersect or not. (iii) What 
consistency criterion is being enforced (for instance, constraints differ between serializability and 
snapshot isolation Q). (iv) How, after executing a transaction on some initial site, the system 
replicates its effects at a remote site: by replaying the transaction, or by applying the after-values 
computed at the initial site. 

8 Formally, we need to generalise the equivalence relation between schedules, which currently is based only on Iff 1 18 1. The 
definition of consistency now becomes that every pair of sites eventually converges to schedules that are equivalent according 
to the new relation. 
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Table ^ exhibits semantic constraints between transactions, where (a) the system replicates a 
transaction by writing its after-values, and (b) transactions are strictly serialisable@ Supporting a 
different semantics, e.g., (a') replaying actions, or (b') SI, requires only some small changes to the 
table. 

7 Related work 

In previous OR systems, commitment was often either centralised at a primary site 1151 201 or 
oblivious of semantics |7, 17|. It is very difficult to combine decentralisation with semantics. 

Our election algorithm is inspired by Keleher's Deno system |8|, a pessimistic system, which 
performs a discrete sequence of elections. Keleher proposes plurality voting to ensure progress 
when none of multiple competing proposals gains a majority. The VVWV protocol of Barreto and 
Ferreira generalizes Deno's voting procedure, enabling continuous voting 1 1 1. 

The only semantics supported by Deno or VVWV is to enforce Lamport's happens-before rela- 
tion 1 10 1 ; all actions are assumed be mutually non-commuting. Happens-before captures potential 
causality; however an event may happen-before another even if they are not truly dependent. This 
paper further generalizes VVWV by considering semantic constraints. 

Holliday et al. depict a family of epidemic algorithms to ensure serializability in replicated dat- 
base systems 1 5 1. The three algorithms consider that concurrent conflicting transactions are antago- 
nistic. Two of them abort concurrent conflicting transactions, and the last one (quorum-based) can 
only commit one transactions among a set of concurrent conflicting ones. Our algorithm consider 
that concurrent conflicting transactions are not necessarily antagonistic, it tries to optimize the num- 
ber of committed transactions, computing a best-effort proposal , and electing them with plurality. 

ESDS 1 4- 1 is a decentralised replication protocol that supports some semantics. It allows users to 
create an arbitrary causal dependence graph between actions. ESDS eventually computes a global 
total order among actions, but also includes an optimisation for the case where some action pairs 
commute. ESDS does not consider atomicity or antagonism relations, nor does it consider dead 
actions. 

Bayou |20| supports arbitrary application semantics. User-supplied code controls whether an 
action is committed or aborted. However the system imposes an arbitrary total execution order. 
Bayou centralises decision at a single primary replica. 

IceCube [9| introduced the idea of reifying semantics with constraints. The IceCube algorithm 
computes optimal proposals, minimizing the number of dead actions. Like Bayou, commitment in 
IceCube is centralised at a primary. Compared to this article, IceCube supports a richer constraint 
vocabulary, which is useful for applications, but harder to reason about formally. 

The Paxos distributed protocol 11 II computes a total order. Such total order may be used to 
implement state-machine replication 1 10 1, whereby all sites execute exactly the same schedule. Such 
a total order over all actions is necessary only if all actions are mutually non-commuting. In Section|5] 
we showed how to combine semantic constraints with a total order, but this approach is clearly 

<jj ji (j enotes j happens-before T 1101 . T \\ T 1 denotes concurrency, i.e., neither T x T', nor T' -< T. RS(T) and 
WS(T) denote 7"s read set and write set respectively. 
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sub-optimal. Howover, Paxos remains live even if / < | sites crash forever, whereas the other 
systems described here (including ours) block if a site crashes forever. We assume that a site stores 
its multilogs and its proposals in persistent memory, and that after a crash it with its identity and 
persistent store intact. This is a fairly reasonable assumption in a well-managed cooperative system. 
(For instance, each site might actually be implemented as a cluster on a LAN, with redundant storage, 
and strong consistency internally.) 

Generalized Paxos 1121 and Generic Broadcast 1 14 1 take commutativity relations into account 
and compute a partial order. They do not consider any other semantic relations. Both Generalized 
Paxos 1 12 1 and our algorithm make progress when a majority is not reached, although through dif- 
ferent means. Generalized Paxos starts a new election instance, whereas our algorithm waits for a 
plurality decision. 

8 Conclusion and future work 

The focus of our study is cooperative applications with rich semantics. Previous approaches to 
replication did not support a sufficiently rich repertoire of semantics, or relied on a centralized point 
of commitment. They often impose a total order, which is stronger than necessary. 

In contrast, we propose a decentralized commitment protocol for semantically-rich systems. Our 
approach is to reify semantic relations as constraints, which restrict the scheduling behavior of the 
system. According to our formal definition of consistency, the system has an obligation to resolve 
conflicts, and to eventually execute equivalent stable schedules at all sites. 

Our protocol is safe in the absence of Byzantine faults, and live in the absence of crashes. It 
uses voting to avoid any centralization bottleneck, and to ensure that the result is similar to local 
proposals. It uses plurality voting to make progress even when an election does not reach a majority. 

There is an interesting trade-off in the proposal/voting procedure. The system might decide fre- 
quently, in small increments, so that users quickly know whether their tentative actions are accepted 
or rejected. However this might be non-optimal as it may cut off interesting future behaviors. Or 
it may base its decisions on a large batch of tentative actions, deciding less frequently. This im- 
poses more uncertainty on users, but decisions may be closer to the optimum. We plan to study this 
trade-off in our future work. 

Another future direction is partial replication. In such a system, a site receives only the actions 
relative to the objects it replicates (and their constraints). A site votes only on the actions it knows. 
Because constraints might relate actions known only by distinct sites, these sites must agree together; 
however we expect that global agreement is rarely necessary. By exploiting knowledge of semantic 
constraints, we hope to limit the scope of a commitment protocol to small-scale agreements, instead 
of a global consensus. 
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